Search CORE

20 research outputs found

Influence des domaines de spécialité dans l'extraction de termes-clés

Author: Boudin Florian
Bougouin Adrien
Daille Béatrice
Publication venue: HAL CCSD
Publication date: 01/07/2014
Field of study

National audienceLes termes-clés sont les mots ou les expressions polylexicales qui représentent le contenu principal d'un document. Ils sont utiles pour diverses applications, telles que l'indexation automatique ou le résumé automatique, mais ne sont pas toujours disponibles. De ce fait, nous nous intéressons à l'extraction automatique de termes-clés et, plus particulièrement, à la difficulté de cette tâche lors du traitement de documents appartenant à certaines disciplines scientifiques. Au moyen de cinq corpus représentant cinq disciplines différentes (archéologie, linguistique, sciences de l'information, psychologie et chimie), nous déduisons une échelle de difficulté disciplinaire et analysons les facteurs qui influent sur cette difficulté

DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases

Author: Bougouin Adrien
Frank Eibe
Glorot Xavier
Ioffe Sergey
Kim Su Nam
Kim Youngsam
Kingma Diederik
Mihalcea Rada
Mikolov Tomáš
Qazvinian Vahed
Wan Xiaojun
Wang Yining
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/05/2019
Field of study

Keyphrase extraction from documents is useful to a variety of applications such as information retrieval and document summarization. This paper presents an end-to-end method called DivGraphPointer for extracting a set of diversified keyphrases from a document. DivGraphPointer combines the advantages of traditional graph-based ranking methods and recent neural network-based approaches. Specifically, given a document, a word graph is constructed from the document based on word proximity and is encoded with graph convolutional networks, which effectively capture document-level word salience by modeling long-range dependency between words in the document and aggregating multiple appearances of identical words into one node. Furthermore, we propose a diversified point network to generate a set of diverse keyphrases out of the word graph in the decoding process. Experimental results on five benchmark data sets show that our proposed method significantly outperforms the existing state-of-the-art approaches.Comment: Accepted to SIGIR 201

arXiv.org e-Print Archive

Crossref

TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation

Author: Barreaux Sabine
Boudin Florian
Bougouin Adrien
Daille Béatrice
Romary Laurent
Publication venue: HAL CCSD
Publication date: 23/05/2016
Field of study

International audienceKeyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Indexation automatique par termes-clés en domaines de spécialité

Author: Bougouin Adrien
Publication venue: HAL CCSD
Publication date: 27/10/2015
Field of study

Keyphrases are words or multi-word expressions that representthe content of a document. Keyphrases give a synopticview of a document and help to index it for informationretrieval. This Ph.D thesis focuses on domain-specificautomatic keyphrase annotation. Automatic keyphrase annotationis still a difficult task, and current systems do notachieve satisfactory results. Our work is divided in two steps.First, we propose a keyphrase candidate selection methodthat focuses on the categories of adjectives relevant withinkeyphrases and propose a method to rank them accordingto their importance within the document. This method, TopicRank,is a graph-based method that clusters keyphrasecandidates into topics, ranks the topics and extracts onekeyphrase per important topic. Our experiments show thatTopicRank significantly outperforms other graph-basedmethodsfor automatic keyphrase annotation. Second, we focuson domain-specific documents and adapt our previous work.We study the best practice of manual keyphrase annotationby professional indexers andmimic it with a newmethod, TopicCoRank.TopicCoRank adds a new graph representing thespecific domain to the topic graph of TopicRank. Leveragingthis second graph, TopicCoRank possesses the rare ability toprovide keyphrases that do not occur within documents. Appliedon four corpora of four specific domains, TopicCoRanksignificantly outperforms TopicRank.Les termes-clés, ou mots-clés, sont des mots ou des expressionsqui représentent le contenu d’un document. Ilsen donnent une représentation synthétique et permettent del’indexer pour la recherche d’information. Cette thèse s’intéresseà l’indexation automatique par termes-clés de documentsen domaines de spécialité. La tâche est difficile à réaliseret les méthodes actuelles peinent encore à atteindre desrésultats satisfaisants. Notre démarche s’organise en deuxtemps. Dans un premier temps, nous nous intéressons à l’indexationpar termes-clés en général. Nous proposons uneméthode pour sélectionner des termes-clés candidats dansun document en nous focalisant sur la catégorie des adjectifsqu’ils peuvent contenir, puis proposons uneméthode pourles ordonner par importance. Cette dernière, TopicRank, sesitue en aval de la sélection des candidats. C’est une méthodeà base de graphe qui groupe les termes-clés candidatsvéhiculant le même sujet, projette les sujets dans un grapheet extrait un terme-clé par sujet. Nos expériences montrentque TopicRank est significativement meilleur que les précédentesméthodes à base de graphe. Dans un second temps,nous adaptons notre travail à l’indexation par termes-clésen domaines de spécialité. Nous étudions la méthodologied’indexation manuelle de documentalistes et la simulons àl’aide de TopicCoRank. TopicCoRank ajoute à TopicRank ungraphe qui représente le domaine de spécialité du document.Grâce à ce second graphe, TopicCoRank possède la rare capacitéà fournir des termes-clés qui n’apparaissent pas dansles documents. Appliqué à quatre domaines de spécialité, TopicCoRankaméliore significativement TopicRank

Thèses en Ligne

TopicRank : ordonnancement de sujets pour l'extraction automatique de termes-clés

Author: Boudin Florian
Bougouin Adrien
Publication venue: 'Associacio catalana de Salut Laboral'
Publication date: 01/01/2014
Field of study

International audienceLes termes-clés sont les mots ou les expressions polylexicales qui représentent le contenu principal d'un document. Ils sont utiles pour diverses applications telles que l'indexa-tion automatique ou le résumé automatique, mais ne sont cependant pas disponibles pour la plupart des documents. La quantité de ces documents étant de plus en plus importante, l'ex-traction manuelle des termes-clés n'est pas envisageable et la tâche d'extraction automatique de termes-clés suscite alors l'intérêt des chercheurs. Dans cet article nous présentons Topic-Rank, une méthode non supervisée à base de graphe pour l'extraction de termes-clés. Cette méthode groupe les termes-clés candidats en sujets, ordonne les sujets et extrait de chacun des meilleurs sujets le terme-clé candidat qui le représente le mieux. Les expériences réali-sées montrent une amélioration significative vis-à-vis de l'état de l'art des méthodes à base de graphe pour l'extraction non supervisée de termes-clés. ABSTRACT. Keyphrases are single or multi-word expressions that represent the main content of a document. As keyphrases are useful in many applications such as document indexing or text summarization, and also because the vast amount of data available nowadays cannot be manu-ally annotated, the task of automatically extracting keyphrases has attracted considerable atten-tion. In this article we present TopicRank, an unsupervised graph-based method for keyphrase extraction. This method clusters the keyphrase candidates into topics, ranks these topics and extracts the most representative candidate for each of the best topics. Our experiments show a significant improvement over the state-of-the-art graph-based methods for keyphrase extraction

Keyphrase Annotation with Graph Co-Ranking

Author: Boudin Florian
Bougouin Adrien
Daille Béatrice
Publication venue: HAL CCSD
Publication date: 11/12/2016
Field of study

International audienceKeyphrase annotation is the task of identifying textual units that represent the main content of a document. Keyphrase annotation is either carried out by extracting the most important phrases from a document, keyphrase extraction, or by assigning entries from a controlled domain-specific vocabulary, keyphrase assignment. Assignment methods are generally more reliable. They provide better-formed keyphrases, as well as keyphrases that do not occur in the document. But they are often silent on the contrary of extraction methods that do not depend on manually built resources. This paper proposes a new method to perform both keyphrase extraction and keyphrase assignment in an integrated and mutual reinforcing manner. Experiments have been carried out on datasets covering different domains of humanities and social sciences. They show statistically significant improvements compared to both keyphrase extraction and keyphrase assignment state-of-the art methods

TopicRank en domaines de spécialité : participation du LINA à DEFT 2016

Author: Boudin Florian
Bougouin Adrien
Daille Béatrice
Publication venue: HAL CCSD
Publication date: 04/07/2016
Field of study

International audienceThis article presents the participation of the TALN group at LINA to the défi fouille de textes (DEFT) 2016. Developed specifically for automatic keyphrase annotation, our system improves an existing method (TopicRank), mimicking professional indexers of Digital Libraries. Our system ranked third out of a total of five systems.Cet article présente la participation de l'équipe TALN du LINA au défi fouille de textes (DEFT) 2016. Développé pour l'indexation de mots-clés, notre système reprend une méthode à base de graphe état de l'art dans le domaine (TopicRank) et l'étend afin de mieux répondre aux attentes d'une indexation professionnelle réalisée dans le cadre d'une bibliothèque scientifique numérique. Notre système s'est classé à la troisième place sur un total de cinq participants

Modélisation unifiée du document et de son domaine pour une indexation par termes-clés libre et contrôlée

Author: Boudin Florian
Bougouin Adrien
Daille Béatrice
Publication venue: HAL CCSD
Publication date: 04/07/2016
Field of study

International audienceUnified document and domain-specific model for keyphrase extraction and assignment This paper focuses on document indexing from keyphrases as performed by professional indexers. From an analysis of indexers working at Digital Libraries, we propose a graph-based method that combines both document information and domain-specific knowledge to perform both keyphrase extraction and assignment (free and controlled indexing). Apart from begin able to assign keyphrases that do not necessarily appear within documents, our experiments show that our approach outperforms the state-of-the-art graph-based approach.Dans cet article, nous nous intéressons à l'indexation de documents de domaines de spécialité par l'intermédiaire de leurs termes-clés. Plus particulièrement, nous nous intéressons à l'indexation telle qu'elle est réalisée par les documentalistes de bibliothèques numériques. Après analyse de la méthodologie de ces indexeurs professionnels, nous proposons une méthode à base de graphe combinant les informations présentes dans le document et la connaissance du domaine pour réaliser une indexation (hybride) libre et contrôlée. Notre méthode permet de proposer des termes-clés ne se trouvant pas nécessairement dans le document. Nos expériences montrent aussi que notre méthode surpasse significativement l'approche à base de graphe état de l'art

Indexation d’articles scientifiques. Présentation et résultats du défi fouille de textes DEFT 2016

Author: Barreaux Sabine
Boudin Florian
Bougouin Adrien
Cram Damien
Daille Béatrice
Hazem Amir
Publication venue: 'ISTE OpenScience'
Publication date: 01/01/2018
Field of study

International audienc

Hal-Diderot

Indexation d’articles scientifiques. Présentation et résultats du défi fouille de textes DEFT 2016

Author: Barreaux Sabine
Boudin Florian
Bougouin Adrien
Cram Damien
Daille Béatrice
Hazem Amir
Publication venue: 'ISTE OpenScience'
Publication date: 01/01/2018
Field of study

International audienc

HAL Descartes